{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "DataAnalysis"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "import re"
   ]
  },
  {
   "source": [
    "# Ch07 从文本提取信息\n",
    "\n",
    "学习目标\n",
    "\n",
    "1.  从非结构化文本中提取结构化数据\n",
    "2.  识别一个文本中描述的实体和关系\n",
    "3.  使用语料库来训练和评估模型"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 7.5 命名实体识别\n",
    "\n",
    "-   命名实体（Named Entity，NE）：是确切的名词短语，指特定类型的个体。\n",
    "-   命名实体识别（Named Entity Recognition，NER）即识别所有文本中提及的命名实体。\n",
    "    -   主要方法：查词典\n",
    "    -   主要困难：名称有歧义\n",
    "    -   主要手段：基于分类器进行分类\n",
    "    -   两个子任务\n",
    "        1.  确定NE的边界\n",
    "        2.  确定NE的类型"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[('``', '``'), ('We', 'PRP'), ('have', 'VBP'), ('no', 'DT'), ('useful', 'JJ'), ('information', 'NN'), ('on', 'IN'), ('whether', 'IN'), ('users', 'NNS'), ('are', 'VBP'), ('at', 'IN'), ('risk', 'NN'), (',', ','), (\"''\", \"''\"), ('said', 'VBD'), ('*T*-1', '-NONE-'), ('James', 'NNP'), ('A.', 'NNP'), ('Talcott', 'NNP'), ('of', 'IN'), ('Boston', 'NNP'), (\"'s\", 'POS'), ('Dana-Farber', 'NNP'), ('Cancer', 'NNP'), ('Institute', 'NNP'), ('.', '.')]\n"
     ]
    }
   ],
   "source": [
    "sent = nltk.corpus.treebank.tagged_sents()[11]\n",
    "print(sent)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  ``/``\n  We/PRP\n  have/VBP\n  no/DT\n  useful/JJ\n  information/NN\n  on/IN\n  whether/IN\n  users/NNS\n  are/VBP\n  at/IN\n  risk/NN\n  ,/,\n  ''/''\n  said/VBD\n  *T*-1/-NONE-\n  (NE James/NNP)\n  A./NNP\n  Talcott/NNP\n  of/IN\n  (NE Boston/NNP)\n  's/POS\n  Dana-Farber/NNP\n  Cancer/NNP\n  Institute/NNP\n  ./.)\n"
     ]
    }
   ],
   "source": [
    "# 使用nltk.ne_chunk()函数调用分类器，binary=True表示标注为NE，否则会添加类型标签，例如：PERSON，GPE等等。\n",
    "print(nltk.ne_chunk(sent, binary=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  ``/``\n  We/PRP\n  have/VBP\n  no/DT\n  useful/JJ\n  information/NN\n  on/IN\n  whether/IN\n  users/NNS\n  are/VBP\n  at/IN\n  risk/NN\n  ,/,\n  ''/''\n  said/VBD\n  *T*-1/-NONE-\n  (PERSON James/NNP A./NNP Talcott/NNP)\n  of/IN\n  (GPE Boston/NNP)\n  's/POS\n  Dana-Farber/NNP\n  Cancer/NNP\n  Institute/NNP\n  ./.)\n"
     ]
    }
   ],
   "source": [
    "# NLTK提供的是已经训练好的可以识别命名实体的分类器\n",
    "print(nltk.ne_chunk(sent))"
   ]
  },
  {
   "source": [
    "## 7.6 关系抽取\n",
    "寻找指定类型的命名实体之间的关系"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']\n[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']\n[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']\n[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']\n[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']\n[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']\n[ORG: 'WGBH'] 'in' [LOC: 'Boston']\n[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']\n[ORG: 'Omnicom'] 'in' [LOC: 'New York']\n[ORG: 'DDB Needham'] 'in' [LOC: 'New York']\n[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']\n[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']\n[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']\n"
     ]
    }
   ],
   "source": [
    "# 1）寻找所有（X，α，Y）形式的三元组，其中X和Y是指定类型的命名实体，α表示X和Y之间的关系的字符串\n",
    "# 搜索包含词in的字符串\n",
    "IN = re.compile(r'.*\\bin')\n",
    "for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):\n",
    "    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN):\n",
    "        print(nltk.sem.rtuple(rel))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']\n[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']\n[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']\n[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']\n[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']\n[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']\n[ORG: 'WGBH'] 'in' [LOC: 'Boston']\n[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']\n[ORG: 'Omnicom'] 'in' [LOC: 'New York']\n[ORG: 'DDB Needham'] 'in' [LOC: 'New York']\n[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']\n[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']\n[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']\n"
     ]
    }
   ],
   "source": [
    "# “(?!\\b.+ing)”是一个否定预测先行断言，忽略如“success in supervising the transition of” 这样的字符串\n",
    "IN = re.compile(r'.*\\bin\\b(?!\\b.+ing)')\n",
    "for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):\n",
    "    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN):\n",
    "        print(nltk.sem.rtuple(rel))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# nltk.sem NLTK （Semantic Interpretation Package）语义解释包\n",
    "# 用于表达一阶逻辑的语义结构和评估集合论模型的公式\n",
    "# This package contains classes for representing semantic structure in\n",
    "# formulas of first-order logic and for evaluating such formulas in\n",
    "# set-theoretic models.\n",
    "from nltk.corpus import conll2002\n",
    "\n",
    "vnv = '''\n",
    "(\n",
    "is/V|\n",
    "was/V|\n",
    "werd/V|\n",
    "wordt/V\n",
    ")\n",
    ".*\n",
    "van/Prep\n",
    "'''\n",
    "VAN = re.compile(vnv, re.VERBOSE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "VAN(\"cornet_d'elzius\", 'buitenlandse_handel')\n",
      "VAN('johan_rottiers', 'kardinaal_van_roey_instituut')\n",
      "VAN('annie_lennox', 'eurythmics')\n"
     ]
    }
   ],
   "source": [
    "# 荷兰语分析工具\n",
    "for doc in conll2002.chunked_sents('ned.train'):\n",
    "    for rel in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):\n",
    "        # 抽取具备特定关系的命名实体\n",
    "        # clause = nltk.sem.clause(rel, relsym='VAN')\n",
    "        print(nltk.sem.clause(rel, relsym='VAN'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "...'')[PER: \"Cornet/V d'Elzius/N\"] 'is/V op/Prep dit/Pron ogenblik/N kabinetsadviseur/N van/Prep staatssecretaris/N voor/Prep' [ORG: 'Buitenlandse/N Handel/N'](''...\n",
      "...'')[PER: 'Johan/N Rottiers/N'] 'is/V informaticacoördinator/N van/Prep het/Art' [ORG: 'Kardinaal/N Van/N Roey/N Instituut/N']('in/Prep'...\n",
      "...'Door/Prep rugproblemen/N van/Prep zangeres/N')[PER: 'Annie/N Lennox/N'] 'wordt/V het/Art concert/N van/Prep' [ORG: 'Eurythmics/N']('vandaag/Adv in/Prep'...\n"
     ]
    }
   ],
   "source": [
    "for doc in conll2002.chunked_sents('ned.train'):\n",
    "    for rel in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):\n",
    "        # 抽取具备特定关系的命名实体所在窗口的上下文\n",
    "        # rtuple = nltk.sem.rtuple(rel, lcon=True, rcon=True)\n",
    "        print(nltk.sem.rtuple(rel, lcon = True, rcon = True))"
   ]
  },
  {
   "source": [
    "## 7.7 小结\n",
    "\n",
    "-   信息提取系统搜索大量非结构化文本，寻找特定类型的实体和关系，并将它们用来填充有组织的数据库。\n",
    "    -   这些数据库可以用来寻找特定问题的答案\n",
    "-   信息提取系统的典型结构\n",
    "    -   以断句开始，\n",
    "    -   然后是分词和词性标注。\n",
    "    -   接下来在产生的数据中搜索特定类型的实体。\n",
    "    -   最后，信息提取系统着眼于文本中提到的相互邻近的实体，并试图确定这些实体之间是否有指定的关系\n",
    "-   实体识别通常采用分块器，分割多标识符序列，并且使用适当的实体类型给块加标签。\n",
    "    -   常见的实体类型包括：组织、人员、地点、日期、时间、货币、GPE（地缘政治实体）\n",
    "-   构建分块器的方法\n",
    "    -   利用基于规则的的系统，NLTK中的RegexpParser类；\n",
    "    -   或者使用机器学习技术，NLTK中的ConsecutiveNPChunker类。\n",
    "    -   词性标记是搜索分块时的重要特征\n",
    "-   虽然分块器专门用来建立相对平坦的数据结构，其中任意两个块不允许重叠，但是分块器仍然可以被串联在一起，建立块的嵌套结构\n",
    "-   关系抽取\n",
    "    -   可以使用基于规则的系统查找文本中的联结实体和相关词的特定模式，即满足关系要求的实体；\n",
    "    -   也可以使用基于机器学习的系统从训练语料中自动学习这种特定模式，然后依据模式抽取满足关系要求的实体。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ]
}